This is a short note originated from the introduction to my PhD thesis. We will empirically compare the annealing scheme implicated by diffusion models to the classical simulated annealing scheme.
To sample a measure $\mu_\text{data}$, simulated annealing, Langevin dynamics run the SDE $$ \mathrm{d} X_t = \nabla \log p(X_t) \mathrm{d} t + \sqrt{2} \mathrm{d}W_t $$ which converges to the target distribution $p$ as $t\to\infty$.
However, if $\mu$ is complex (multimodal), the dynamics can take a long time to converge. Therefore, people have employed a technique called simulated annealing. There, one replaces $p$ by a path of measures $p_t$ which converges to $p$ as $t \to \infty$. Then, one can run $$ \mathrm{d} X_t = \nabla \log p_t(X_t) \mathrm{d} t + \sqrt{2} \mathrm{d}W_t. $$ If $p_t$ is relatively simple for low $t$, the Langevin dynamics have an easier time finding the modes of the distribution. Then, if $p_t$ only changes rarely, or just slowly over time, the dynamics can slowly move into the right areas of the state space. A common annealing scheme is $$ p_t = p^{\beta_t} $$ where $\beta_t:[0, 1] \to t$ is an increasing function starting close to $0$ and going to $1$ as $t \to \infty$.
The reverse SDE in score-based generative models has the form $$ dX_t = \frac{1}{2} X_t \mathrm{d} t + \nabla \log p_{T-t}(X_t) \mathrm{d}t + \mathrm{d}W_t $$ where $p_{t}$ are the marginals of the forward process. The forward process in an Ornstein-Uhlenbeck-process started in the target distribution $p$, and therefore, the $p_t$ are smoothed versions of $p$. If we forget about the $\frac{1}{2}X_t$ term, this already looks very close to annealed Langevin dynamics. Furthermore, annealed Langevin Dynamics were one of the inspirations to Yang Song when he rediscovered diffusion models and made the connection to stochastic differential equations, see https://yang-song.net/blog/2021/score/.
I have already often wondered if one can compare the diffusion model algorithm to MCMC schemes. In particular, I find the question interesting why they can sample from such complicated distributions. However, the comparison is not trivial, since albeit both algorithm classes have the same goal of sampling a distribution, in MCMC one is given an unnormalized density, whereas in diffusion models one is given samples. However, by viewing diffusion models as an annealing scheme for Langevin dynamics, one can use it to sample from densities.
The problem is that $p_t$ is not easy to calculate. Since it is the marginal of the forward process, it has the form $$ p_t(x) = \int \exp(-\frac{1}{2}\frac{\|x - e^{-t/2}x_0\|^2}{1 - e^{-t}})~p(x_0)~\mathrm{d}x_0. $$ The exact form is actually not relevant. What is important is that it requires integrating over $p_0$ - which we cannot do, since we are trying to approximate $p_0$.
However, for a Gaussian mixture distribution, one can calculate the diffusion model annealing scheme in closed form. Therefore, we can easily empirically compare the classical annealing scheme to the diffusion model annealing scheme. To that end we choose a Gaussian mixture with $3$ component, each having the same weight: $$ \mu = \text{Uniform Mixture}\left(~\mathcal{N}(-20, 0.1), ~\mathcal{N}(0, 0.4), ~\mathcal{N}(20, 0.1)~\right) $$
To no surprise, the Langevin dynamics can not sample from the distribution. The orange lines depict the target density $p$. The blue histogram is the result of $5000$ independent runs Langevin dynamics. They run for a time of $T=300$ with e adiscretization step size of $\delta = 0.03$.
To no big surprise, the Langevin dynamics can not sample from this distribution
We now run simulated annealing, also up to time $T=300$ with a step size of $\delta = 0.03$. We anneal $p_t$ using $$ p_t(x) = p(x)^{t/300}, $$ i.e. the $\beta_t$ above is linear. We again plot a blue histogram of the samples over time, while the orange line depicts $p_t$.
However, the annealed dynamics can still not recover the modes:
For our Gaussian mixture $p$, we can explicitly solve for $p_t$, when $p_t$ is the Marginal of an Ornstein-Uhlenbeck process started in $p$. We do the same time-change that is common in diffusion models (i.e. we first start the OU a bit slower and then go faster for larger values of $t$). This results in the following annealed scheme:
This works much better! On the one hand, one could say that this is because each $p_t$ actually has global information about $p$, since it integrates over $p$. For the classical annealing scheme, only local information of $p$ is used to get $\nabla \log p_t$. On a visual level, we see that the diffusion model is starting around 0 and then transporting the mass into the right areas, whereas the classical scheme will have the same modes in the same areas from the beginning on.
Therefore, the obvious next step is trying to find a cheap annealing scheme which also has this transporting quality. To that end, we can use annealing schemes of the form $$ p_t(x) = p(\gamma_t x)^{\beta_t}. $$ Here, $\beta_t: \mathbb{R} \to [0,1]$ will again be increasing as before and increase the breadth of the distribution. The $\gamma_t$ will be decreasing and try to emulate the mass transportation above. I chose $\gamma_t$ and $\beta_t$ by choosing $p$ as a Gaussian with unit variance and running an Ornstein-Uhlenbeck process starting in that Gaussian. The marginals $p_t$ will then also be Gaussian and can be written in the form $p_t = p(\gamma_t x)^{\beta_t}$ for specific choices of $\gamma_t$ and $\beta_t$. I employ these choices as an annealing scheme, which now is actually implementable and as expensive as the other simulated annealing scheme introduced above.
We then get
This is just a first try, but is already looking better than the annealing schedule chosen before, even though it's not as good as the diffusion model scheduling. I have never worked with annealing schemes, and therefore do annealing in the above fashion is common. If anyone knows what kind of annealing schemes people use in practice (in any area) nowadays, I'd be very interested!